Predicting NFL Running Back Performance

Giovanni Lunetta and Sam Lutzel

Outline

  1. Abstract

  2. Goal/Motivation

  3. Methods

  4. Data

  5. Preliminary Data Exploration

Abstract

This study employs an ensemble of machine learning techniques, including Multiple Linear Regression (MLR), Generalized Linear Models (GLM), and XGBoost to enhance predictive analytics in the National Football League (NFL). The research focuses on one of the most critical aspects of football offense - the running game. In order to predict the yards gained by NFL running backs following a handoff, the study analyzes a plethora of player tracking data from the 2017 to 2019 seasons. It integrates a variety of factors such as player positions, orientation, and the situational context of the game, including down, distance, and field position. This multifaceted approach aims to yield highly accurate models that can serve as valuable tools for coaching staffs to optimize play-calling, manage player workloads, and enhance game planning. The predictive insights derived from this research are intended to support teams in deploying their running backs more effectively, leading to potentially improved outcomes on the field. As the NFL continues to evolve with a greater emphasis on analytics, this study seeks to contribute significantly to the field of sports analytics by providing a model that underscores the importance of the running game in a predominantly pass-oriented league.

Goal/Motivation

The primary goal of this research is to develop cutting-edge predictive models capable of accurately estimating the yards a running back will gain after a handoff during NFL games. This objective is pivotal for formulating advanced offensive strategies and refining player evaluations. The running play is a fundamental aspect of the game that can dictate the tempo, control the clock, and establish physical dominance.

The motivation behind this study stems from the transformative impact that data analytics has had on sports, particularly in the NFL, where the fusion of technology and sports science has begun to redefine how the game is played and understood. In an era where marginal gains are increasingly sought after, the ability to predict the outcome of running plays with high precision can provide teams with a significant competitive advantage. It enables coaches to make informed decisions regarding play selection, player rotations, and game management, especially in critical moments of a match. Additionally, the insights from this study can empower front offices in their scouting and drafting processes by quantifying the expected value a running back adds to their team. Furthermore, there is also a tremendous opportunity to leverage these predictive models on the defensive end of the ball, allowing teams to better anticipate and defend against running plays. With better insights into the opponents running game, defenses can adjust their schemes and personnel to counter the opposing team’s offensive strategy, such as by stacking the box or blitzing the quarterback. In addition to the benefits performance analytics provides to the teams, it also helps fans select better fantasy football teams and make more informed betting decisions. This leads to a more engaging and enjoyable experience for the fans, which is critical for the long-term success of the league.

Ultimately, the true beauty of this research lies in its ability to bridge the gap between complex player tracking data and practical on-field strategies. In specific, it’s about enhancing the very essence of the game and enriching the broader discourse on sports performance analytics. By pioneering research in this domain, this study is set to propel the analytical capabilities of NFL teams to new heights, providing them with tools that were once considered unimaginable. Ultimately, it contributes to the ongoing evolution of the sport itself, marking a pivotal moment in the history of football and the broader world of sports analytics.

Methods

Data Exploration

Prior to developing our predictive models, we will conduct a comprehensive exploratory data analysis (EDA) to understand the underlying structure and characteristics of the data. This step will involve visualizing the distribution of key variables, identifying patterns and outliers, and exploring correlations and interactions between predictors. Data visualization, through techniques like scatter plots, histograms, and heatmaps, can offer an intuitive understanding of data distributions, correlations, and potential clusters within the dataset. EDA will inform our feature selection and engineering strategies by highlighting potential predictors that are most informative of yards gained after a handoff.

Data Cleaning and Preprocessing

Data cleaning and preprocessing are critical to ensuring the quality and integrity of our models. We will clean the dataset by addressing any inconsistencies, handling missing data through imputation techniques, and removing or correcting outliers. Preprocessing steps will include encoding categorical variables using one-hot encoding or label encoding methods, scaling and normalizing numerical variables to ensure they are on comparable scales, and potentially transforming variables to better meet the assumptions of our models. In addition, it may be necessary to remove certain variables that may not be relevant to the target variable or may introduce multicollinearity issues.

Feature Engineering

During preprocessing, we will also undertake feature engineering to create new variables that may have a stronger relationship with the target variable. This could include interaction terms that capture the combined effect of two predictors, polynomial features for capturing non-linear relationships, and domain-specific features that encapsulate strategic elements of the game.

Model Development and Evaluation

With a clean and prepared dataset, we will then proceed to develop our MLR, GLM, and XGBoost models.

Multiple Linear Regression (MLR)

Multiple linear regression models predict a continuous repsonse variable using a linear combination of predictors. For our baseline MLR model, we will use the ordinary least squares (OLS) method to estimate the coefficients of our predictor variables. The model is specified as:

\(Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \ldots + \beta_pX_{ip} + \epsilon_i\)

where \(Y_i\) represents the yards gained after the handoff for the \(i^{th}\) observation, \(X_{ip}\) is the \(i^{th}\) observation on the \(j^{th}\) predictor variable (where \(j = 1, ..., p\)), \(\beta_p\) is the \(j^{th}\) coefficient to be estimated corresponding to the proper \(X_{ip}\) variable, and \(\epsilon_i\) is the error term. The \(j^{th}\) coefficient, \(\beta_j\), represents the change in the response variable for a unit change in the \(X_{ip}\) predictor variable, while holding all other predictors constant.

The ordinary least squares (OLS) method will be deployed to minimize the sum of the squared differences between the observed and predicted values. The optimization problem can be represented as minimizing the sum of errors squared:

\[ \Sigma_{i=0}^n \epsilon_i^2 = \Sigma_{i=0}^n (Y_i - \hat{Y}_i)^2 = \Sigma_{i=0}^n (Y_i - \beta_0 - \beta_1X_{i1} - \ldots - \beta_pX_{ip})^2 \]

Here, \(\Sigma_{i=0}^n \epsilon_i^2\) represents the sum of the squared residuals, which we aim to minimize. The variable \(\hat{Y}_i\) represents the predicted outcome of the model. This approach assumes that the relationship between the independent variables and the dependent variable is linear. To ensure the robustness of our model, we will conduct a series of diagnostic tests:

  1. Linearity: We will use scatter plots and residual plots to verify that the relationship between the predictors and the response is linear.
  2. Homoscedasticity: We will inspect the residuals to confirm constant variance across all levels of the independent variables. This can be assessed visually using a residual vs. fitted values plot.
  3. Independence: The Durbin-Watson test will help in detecting the presence of autocorrelation in the residuals, which should not be present in the data.
  4. Normality of Residuals: Normality will be checked using Q-Q plots and statistical tests like the Shapiro-Wilk test.

If any assumptions are violated, we may consider transformations of variables or the use of robust regression techniques.

Generalized Linear Models (GLM)

GLMs extend the linear model framework to allow for response variables that have error distribution models other than a normal distribution. They are particularly useful when dealing with non-normal response variables, such as count data or binary outcomes. In its general form, a GLM consists of three elements:

  1. Random Component: Specifies the probability distribution of the response variable \(Y\), such as normal, binomial, Poisson, or exponential.
  2. Systematic Component: A linear predictor \(\eta = X\beta\).
  3. Link Function: A function \(g\) that relates the mean of the response variable \(E(Y)\) to the linear predictor \(\eta\).

The choice of link function is crucial and is typically selected based on the nature of the distribution of the response variable. For instance, a logit link function is used for a binomial distribution, and a log link function is often used for a Poisson distribution.

The likelihood function for a GLM can be written as:

\[ L(\beta) = \prod_{i=1}^{n} f(y_i; \theta_i, \phi) \]

where \(f(y_i; \theta_i, \phi)\) is the probability function for the \(i\)-th observation, \(\theta_i\) is the parameter of interest (e.g., mean), and \(\phi\) is the dispersion parameter. The goal is to find the values of \(\beta\) that maximize this likelihood function.

XGBoost

XGBoost stands for eXtreme Gradient Boosting and is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. It is a powerful technique that can handle a variety of regression and classification problems. For regression, it can be configured to optimize for different loss functions; the most common for regression being the squared error loss:

\[ L(\theta) = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

where \(y_i\) are the observed values, and \(\hat{y}_i\)

In XGBoost, each new tree is built to correct the errors made by the previous ones. The algorithm combines weak predictive models to form a strong predictor. The model’s complexity is controlled by the regularization term \(\Omega(\theta)\) which is a function of the tree structure and the number of leaves. The overall objective function to be minimized is:

\[ \text{Obj}(\theta) = L(\theta) + \lambda \sum_{k}(w_k^2) + \gamma T \]

where \(w_k\) represents the leaf weights of the trees, \(T\) is the number of leaves, \(\lambda\) is the L2 regularization term on the weights, and \(Y\) is the complexity control on the number of leaves. For regression tasks, we can also utilize the quantile loss which is particularly useful for prediction intervals:

\[ L_{\tau}(\theta) = \sum_{i=1}^{n} \left[ \tau (y_i - \hat{y}_i) \mathbb{1}_{y_i \geq \hat{y}_i} + (1 - \tau) (\hat{y}_i - y_i) \mathbb{1}_{y_i < \hat{y}_i} \right] \]

Here, 1 is an indicator function, and \(\tau\) is the quantile of interest, allowing us to model different parts of the conditional distribution of the response variable.

XGBoost also provides a feature importance score, which is a metric that quantifies the contribution of each feature to the model’s predictive power. This is done by measuring the impact on the model’s accuracy each time a feature is used to split the data across all trees.

Model Performance Metrics

For MLR and GLM, the model’s performance will be evaluated using metrics such as the Mean Squared Error (MSE) and the Mean Absolute Error (MAE). For the XGBoost model, along with MSE and MAE, we will assess performance using additional metrics like the R-squared for regression tasks and feature importance scores to understand which variables are most predictive.

Model Interpretation and Application

The final step will be to interpret the models in the context of NFL games. This will involve translating the statistical outputs into actionable insights for coaches and team strategists, providing recommendations on how to leverage the results for competitive advantage in play-calling and player evaluation.

Software and Tools

All analyses will be conducted in R, a statistical computing language that provides a wide array of packages for machine learning and data analysis. For MLR and GLM, we will utilize the stats package that comes with the R base installation. For our XGBoost model, we will use the xgboost package, which is specifically designed for speed and performance. Data manipulation and cleansing will be managed with packages like dplyr and tidyr, while ggplot2 will be employed for data visualization to facilitate understanding and interpretation of the data and model outputs. For feature engineering and preprocessing, we will take advantage of caret or recipes. Hyperparameter tuning can be optimized using the tune package, and for cross-validation, the rsample package will be employed. The broom and modelr packages will be useful for tidying model outputs and working with models in a pipeline, respectively. This suite of packages will enable a comprehensive workflow within R for developing, evaluating, and interpreting the predictive models.

Data

The data is from NFL games played during the 2017, 2018, and the first half of the 2019 seasons.

Each row in the file corresponds to a single player’s involvement in a single play. All the columns are contained in one large dataframe which is grouped and provided by PlayId.

Here are the columns:

GameId - a unique game identifier

PlayId - a unique play identifier

Team - home or away

X - player position along the long axis of the field. See figure below.

Y - player position along the short axis of the field. See figure below.

S - speed in yards/second

A - acceleration in yards/second^2

Dis - distance traveled from prior time point, in yards

Orientation - orientation of player (deg)

Dir - angle of player motion (deg)

NflId - a unique identifier of the player

DisplayName - player’s name

JerseyNumber - jersey number

Season - year of the season

YardLine - the yard line of the line of scrimmage

Quarter - game quarter (1-5, 5 == overtime)

GameClock - time on the game clock

PossessionTeam - team with possession

Down - the down (1-4)

Distance - yards needed for a first down

FieldPosition - which side of the field the play is happening on

HomeScoreBeforePlay - home team score before play started

VisitorScoreBeforePlay - visitor team score before play started

NflIdRusher - the NflId of the rushing player

OffenseFormation - offense formation

OffensePersonnel - offensive team positional grouping

DefendersInTheBox - number of defenders lined up near the line of scrimmage, spanning the width of the offensive line

DefensePersonnel - defensive team positional grouping

PlayDirection - direction the play is headed

TimeHandoff - UTC time of the handoff

TimeSnap - UTC time of the snap

Yards - the yardage gained on the play (you are predicting this)

PlayerHeight - player height (ft-in)

PlayerWeight - player weight (lbs)

PlayerBirthDate - birth date (mm/dd/yyyy)

PlayerCollegeName - where the player attended college

Position - the player’s position (the specific role on the field that they typically play)

HomeTeamAbbr - home team abbreviation

VisitorTeamAbbr - visitor team abbreviation

Week - week into the season

Stadium - stadium where the game is being played

Location - city where the game is being played

StadiumType - description of the stadium environment

Turf - description of the field surface

GameWeather - description of the game weather

Temperature - temperature (deg F)

Humidity - humidity

WindSpeed - wind speed in miles/hour

WindDirection - wind direction

There are 682154 rows and 49 columns.

Preliminary Data Exploration

Lets start by loading the data from the previous folder.

Lets start making some plots to see if we can find any interesting patterns in the data. We will start by looking at the distribution of the target variable, Yards.

As we can see from the plot above, the distribution of yards gained after a handoff is skewed to the right. This is expected since the majority of running plays result in short gains, while a small percentage of plays result in large gains.

Lets do distribution of the number of players in the box.

Based on the plot above, we can see that the number of defenders in the box is normally distributed, with a mean of 6.9 and a standard deviation of 1.6. This is expected since the defense typically lines up with 7 players in the box, but this can vary depending on the offensive formation and the down and distance.

Lets do distribution of the down.

Based on the plot above, we can see that the majority of plays occur on first down, followed by second down, and then third down. This is expected since the offense has four downs to gain 10 yards, so the majority of plays occur on first and second down. In addition, in the case where it is fourth down, you expect there to be the least number of plays since the offense will typically punt the ball or attempt a field goal.

Lets do distribution of the distance to the first down.

This histogram is very interesting. To begin, we see the majority of run plays occur when the distance to the first down is 10 yards. This is because 10 yards is the most common distance to the first down (every drive starts with first and 10 and every play after obtaining a first down starts with first and 10). We also see a spike in the distribution at 1 yard. This is because the offense will typically run the ball on second/third and short (1-3 yards) to try and pick up the first down.

Lets do distribution of YardLine.

Again, like the prior histogram, we have some interesting characteristics and spikes. First, we see a spike at the 25 yard line. This is because the 25 yard line is the touchback line, so if the offense does not run out of the endzone on a kickoff, the ball is automatically placed at the 25 yard line. We also see a spike at the 1 yard line. This is because the offense will typically run the ball that close to the goal line to try and score a touchdown.